RAG Evaluation Benchmarking

This page introduces a quality test of RAG evaluation with DynamoEval using a benchmark dataset. The table below summarizes the results of this notebook.

F1	Retrieval Relevance	Faithfulness	Response Relevance
DynamoEval	0.89	0.89	0.88
RAGAS	0.82	0.81	0.72

Dataset

We modify multidoc2dial dataset to construct positive and negative examples for binary classification in each of retrieval relevance (relevant vs not relevant), faithfulness (faithfuln vs not faithful), and response relevance (relevant vs not relevant). The resulting dataset contains question, context, response, and binary labels for retrieval relevance, faithfulness, and response relevance.

import pandas as pd
df = pd.read_csv("multidoc2dial-samples-with-labels.csv") 
questions = [str(x) for x in df["queries"]]
responses =  [str(x) for x in df["response"]]
contexts = [[str(x)] for x in df["context"]]
ret_gt_labels = list(df["retrieval_label"])
faith_gt_labels = list(df["faithful_label"])
res_gt_labels = list(df["response_label"])
df.head()

Metric

We compute F1 score which measures the classification performance. The greater the value (closer to 1) the better

Setup

Set up some helper functions before running the evaluation. These methods help compute the performance metrics and post-process the output from the evaluation methods.

import numpy as np
from matplotlib import pyplot as plt
from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support

# Method to compute FPR, TPR, AUROC values
def get_roc(pred, gt_labels):
    fpr, tpr, thresholds = metrics.roc_curve(gt_labels, pred, pos_label=1)
    roc_auc = metrics.auc(fpr, tpr)
    return fpr, tpr, roc_auc

# Method to compute optimal threshold for classification
def get_opt_thres_acc(result):
    output = {}
    for k in ["retrieval-relevance", "response-relevance", "faithfulness"]:
        gt_labels = result[f"gt-labels-{k}"]
        pred = result[f"{k}"]
        fpr, tpr, thresholds = metrics.roc_curve(gt_labels, pred, pos_label=1)
        max_f1 = 0
        opt_thr = 0
        for t in thresholds:
            y_pred = [1 if x >= t else 0 for x in pred]
            _, _, f1, _ = precision_recall_fscore_support(gt_labels, y_pred, average="binary")
            if f1 > max_f1:
                max_f1 = f1
                opt_thr = t
        y_pred = [1 if x >= opt_thr else 0 for x in pred]
        precision, recall, f1, _ = precision_recall_fscore_support(gt_labels, y_pred, average="binary")
        accuracy = np.where( np.array(y_pred) - np.array(gt_labels) == 0 )[0].shape[0] / len(y_pred)
        output[k] = {
            "precision": precision,
            "recall": recall, 
            "f1": f1,
            "accuracy": accuracy,
            "thrs": opt_thr,
        }
    return output

# Method to filter out nan output values: those with json parsing errors will not be considered for the evaluation
def filter_nan_vals(scores, gt_labels):
    filtered_scores, filtered_gt_labels = [], []
    for l, gl in zip(scores, gt_labels):
        if not np.isnan(l) and l != -1: # only when the value is not nan
            filtered_scores.append(l)
            filtered_gt_labels.append(gl)
    return filtered_scores, filtered_gt_labels

Run RAGAS

RAGAS has metrics faithfulness, answer_relevancy, context_relevancy, each analogous to faithfulness, response relevance, and retrieval relevance.

from datasets import Dataset
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from ragas import evaluate
import numpy as np

# Format the dataset input for RAGAS
dataset = Dataset.from_dict({"question": questions, "contexts": contexts, "answer": responses})

# Run RAGAS on context-relevancy, faithfulness, and answer-relevancy
results = evaluate(
    dataset, 
    metrics=[
        context_relevancy, 
        faithfulness, 
        answer_relevancy
    ], 
    raise_exceptions=False # ignore json output format errors and return nan for them 
)
ragas_df = results.to_pandas()
context_relevance_scores = list(ragas_df["context_relevancy"])
faithfulness_scores = list(ragas_df["faithfulness"])
response_relevance_scores = list(ragas_df["answer_relevancy"])

# filter out nan output values: those with json parsing errors will not be considered for the evaluation
filtered_context_relevance_scores, filtered_ret_gt_labels = filter_nan_vals(context_relevance_scores, ret_gt_labels)
filtered_faithfulness_scores, filtered_faith_gt_labels = filter_nan_vals(faithfulness_scores, faith_gt_labels)
filtered_response_relevance_scores, filtered_res_gt_labels = filter_nan_vals(response_relevance_scores, res_gt_labels)

# RAGAS results
result_ragas = {
    "retrieval-relevance": filtered_context_relevance_scores,
    "faithfulness": filtered_faithfulness_scores,
    "response-relevance": filtered_response_relevance_scores,
    "gt-labels-retrieval-relevance": filtered_ret_gt_labels,
    "gt-labels-faithfulness": filtered_faith_gt_labels,
    "gt-labels-response-relevance": filtered_res_gt_labels,
}

# compute precision, recall, and f1 scores based on the optimal threshold for the raw prediction scores
get_opt_thres_acc(result_ragas)
# output
\{'retrieval-relevance': \{'precision': 0.8108108108108109,
  'recall': 0.8333333333333334,
  'f1': 0.8219178082191781,
  'accuracy': 0.8266666666666667,
  'thrs': 0.7493216356590829},
 'response-relevance': \{'precision': 0.5806451612903226,
  'recall': 0.9473684210526315,
  'f1': 0.72,
  'accuracy': 0.6410256410256411,
  'thrs': 0.14285714285714285},
 'faithfulness': \{'precision': 0.7272727272727273,
  'recall': 0.9142857142857143,
  'f1': 0.810126582278481,
  'accuracy': 0.7794117647058824,
  'thrs': 0.1}}

Run DynamoEval

DynamoEval has SDK methods retrieval_relevance_judge_text, response_relevance_judge_text, and faithfulness_judge_text to compute scores, labels, and explanations for retrieval relevance, response relevance, and faithfulness. Before running, we first need to set up a Mistral API key as an environment variable.

import os
os.environ["MISTRAL_API_KEY"] = "<your-mistral-api-key>"

from dynamofl.evaluate import (
    retrieval_relevance_judge_text, 
    response_relevance_judge_text, 
    faithfulness_judge_text
)

# Run DynamoEval
ret_result = retrieval_relevance_judge_text(questions, ["\n".join(x) for x in contexts])
faith_result = faithfulness_judge_text(questions, ["\n".join(x) for x in contexts], responses)
res_result = response_relevance_judge_text(questions, responses)
# get labels
ret_score = ret_result["labels"]
faith_score = faith_result["labels"]
res_score = res_result["labels"]

# Filter out nan output values: those with json parsing errors will not be considered for the evaluation
filtered_retrieval_relevance_score, filtered_ret_gt_labels = filter_nan_vals(ret_score, ret_gt_labels)
filtered_faithfulness_scores, filtered_faith_gt_labels = filter_nan_vals(faith_score, faith_gt_labels)
filtered_response_relevance_scores, filtered_res_gt_labels = filter_nan_vals(res_score, res_gt_labels)

# DynamoEval results
result_dynamoeval = {
    "retrieval-relevance": filtered_retrieval_relevance_score,
    "faithfulness": filtered_faithfulness_scores,
    "response-relevance": filtered_response_relevance_scores,
    "gt-labels-retrieval-relevance": filtered_ret_gt_labels,
    "gt-labels-faithfulness": filtered_faith_gt_labels,
    "gt-labels-response-relevance": filtered_res_gt_labels,
}

# print precision, recall, f1 scores
print(precision_recall_fscore_support(result_dynamoeval['gt-labels-retrieval-relevance'], filtered_retrieval_relevance_score, average="binary"))
print(precision_recall_fscore_support(result_dynamoeval['gt-labels-faithfulness'], filtered_faithfulness_scores, average="binary"))
print(precision_recall_fscore_support(result_dynamoeval['gt-labels-response-relevance'], filtered_response_relevance_scores, average="binary"))
# output
(0.8780487804878049, 0.9, 0.888888888888889, None)
(0.9444444444444444, 0.85, 0.8947368421052632, None)
(0.875, 0.875, 0.875, None)

Qualitative Examples

We now take a look at some qualitative examples.

Retrieval Relevance

example_id = 70
print("[Question]", questions[example_id])
print("[Context]", contexts[example_id][0])
print("Ground-truth Label", ret_gt_labels[example_id])
print("RAGAS Score", context_relevance_scores[example_id])
print("Dynamoeval Label", ret_labels[example_id])

[Question] User says " I haven't lost the child's birth certificate." Agent says " Unfortunately, no relevant information is found." User says " But my child lived outside the United States for quite some time." Agent says " In that case, you'll need to provide a document that proves that your child lived in another country and does not have a social security number. You could use records like the child's school records or employment records." User says " How can I guarantee that my personal Social Security account will not be shared by the agency?"

[Context] POW in Japan , or Worked as an x - ray technician, in a reactor plant, or in nuclear medicine or radiography while on active duty or during active or inactive duty for training in the Reserves, or Did tasks like those of a Department of Energy (DOE) employee that make them a member of the Special Exposure Cohort (See 42 U.S.C. 7384L(14)) You may also qualify for disability benefits if you served in at least one of the below locations and capacities. You were: Part of underground nuclear weapons testing at Amchitka Island, Alaska, or Assigned to a gaseous diffusion plant at Paducah, Kentucky , or Assigned to a gaseous diffusion plant at Portsmouth, Ohio , or Assigned to a gaseous diffusion plant at Area K-25 at Oak Ridge, Tennessee

Ground-truth Label 0

RAGAS Score 1.0

Dynamoeval Label 0

Faithfulness

example_id = 47
print("[Context]", contexts[example_id][0])
print("[Response]", responses[example_id])
print("Ground-truth Label", faith_gt_labels[example_id])
print("RAGAS Score", faithfulness_scores[example_id])
print("DynamoEval Label", faith_labels[example_id])

[Context] Your personal mySocial Security account is for your use only Social Security is dedicated to protecting the information and resources entrusted to us, including your personal information and investment. For your protection , you re the only one who can create your personal my Social Security account for your own exclusive use. No one can create or use an account on your behalf, even with written permission. Don t share the use of your account with anyone else under any circumstances, as unauthorized use of this service is a misrepresentation of your identity to the federal government and could subject you to criminal or civil penalties, or both.

[Response] That's right. Under certain circumstances, you can get a disability compensation for illnesses including those derived by contact with radiation during military service.

Ground-truth Label 0

RAGAS Score 0.8

DynamoEval Label 0

Response Relevance

example_id = 60
print("[Question]", questions[example_id])
print("[Response]", responses[example_id])
print("Ground-truth Label", res_gt_labels[example_id])
print("RAGAS Score", response_relevance_scores[example_id])
print("DynamoEval Label", res_labels[example_id])

[Question] User says " Good afternoon. I was reading about possible effects of certain radiations on veterans. That could be worth of a disability benefit?"

[Response] You can download our military sexual trauma brochure for Veterans : In English PDF ,or En espanol PDF

Ground-truth Label 0

RAGAS Score 0.8081080961086075

DynamoEval Label 0

Dataset​

Metric​

Setup​

Run RAGAS​

Run DynamoEval​

Qualitative Examples​

Retrieval Relevance​

Faithfulness​

Response Relevance​

Dataset

Metric

Setup

Run RAGAS

Run DynamoEval

Qualitative Examples

Retrieval Relevance

Faithfulness

Response Relevance